20
Quantization of Neural Networks
FIGURE 2.2
Given s = 1, QN = 0, QP = 3, A) quantizer output and B) gradients of the quantizer
output concerning step size, s, for LSQ, or a related parameter controlling the width of
the quantized domain (equal to s(QP + QN)) for QIL [110] and PACT [43]. The gradient
employed by LSQ is sensitive to the distance between v and each transition point, whereas
the gradient employed by QIL [110] is sensitive only to the distance from quantizer clip
points and the gradient employed by PACT [43] is zero everywhere below the clip point.
Here, we demonstrate that networks trained with the LSQ gradient reach a higher accuracy
than those trained with the QIL or PACT gradients in prior work.
2.2.3
Step Size Gradient Scale
It has been demonstrated that good convergence during training can be achieved when the
ratio of average update magnitude to average parameter magnitude is consistent across all
weight layers in a network. Setting the learning rate correctly helps prevent updates from
being too large and causing repeated overshooting of local minima or too small, leading
to a slow convergence time. Based on this reasoning, it is reasonable to assume that each
step size should also have its update magnitude proportional to its parameter magnitude,
similarly to the weights. Therefore, for a network trained on a loss function L, the ratio
R = ∇sL
s
/∥∇wL∥
∥w∥
,
(2.11)
should be close to 1, where ∥x∥denotes the l2-norm of z. However, as precision increases, the
step size parameter is expected to be smaller (due to finer quantization), and the step size
updates are expected to be larger (due to the accumulation of updates from more quantized
items when computing its gradient). To address this, a gradient scale g is multiplied by the
step size loss. For the weight step size, g is calculated as 1/√NW QP , and for the activation
step size, g is calculated as 1/√NW QP , where NW is the number of weights in a layer and
Nf is the number of features in a layer.
2.2.4
Training
LSQ trains the model quantizers by making the step sizes learnable parameters, with the loss
gradient computed using the quantizer gradient mentioned earlier. In contrast, other model
parameters can be trained with conventional techniques. A common method of training
quantized networks [48] is employed where full precision weights are stored and updated,
while quantized weights and activations are used for forward and backward passes. The
gradient through the quantizer round function is calculated using the straight-through es-
timator [9] so that
∂ˆv
∂v =
1,
if −QN < v/s < Qp,
0,
otherwise,
(2.12)
and stochastic gradient descent is used to update parameters.